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Chapter  2.*  Introduction  and  Wotivation 

The  use  of  digital  computers  to  process  images  has  rapidly  increased. 
It  has  been  estimated  C1]  that  within  the  next  ten  years  a potential  digital 
image  processing  market  of  400  million  frames  per  year  could  develop.  This 
market  will  depend  on  how  powerful  digital  image  processing  techniques  be- 
come and  if  they  can  be  implemented  in  a cost-effective  manner.  Although  the 
cost  of  computing  power  has  decreased,  disregarding  inflation,  it  is  felt 
that  a specialized  computer  architecture  tailored  to  image  processing  might 
give  at  least  two  orders  of  magnitude  gain  in  cost  effectiveness.  As  will  be 
discussed  in  a later  section,  the  throughput  for  an  image  processor,  depend- 
ing on  the  algorithm  and  the  size  of  the  image,  can  be  several  orders  of 
magnitude  greater  than  that  of  a general  purpose  computer  doing  the  same 
operation. 

Image  processing  in  its  most  general  sense  consists  of  a set  of  opera- 
tions performed  on  an  input  image  which  produces  an  output.  However, there 
is  no  consensus  of  what  constitutes  the  set  of  operations  or  what  the 
desired  output  is. 

Image  processing,  in  general,  can  be  separated  into  three  areas,  image 
digitization  and  coding,  image  enhancement  and  restoration,  and  image  seg- 
mentation and  description.  The  first  area  consists  of  the  conversion  of  an 
image  from  a continuous  to  discrete  form  and  the  compression  of  this  infor- 
mation so  as  to  conserve  storage  or  channel  capacity.  The  second  area, 
enhancement  and  restoration,  deals  with  improving  the  quality  of  an  image 
that  has  been  degraded  due  to  noise,  blurring,  lack  of  constrast,  or 
geometric  distortions.  The  third  area,  segmentation  and  description,  is 
perhaps  the  most  difficult  of  the  three.  It  involves  the  measurement  of  pro- 
perties, or  features,  of  the  image  or  parts  thereof  and  the  classification 
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or  description  of  the  image  in  terms  of  these  properties. 

The  motivation  of  this  research  is  based  on  a number  of  factors.  First 
of  all,  image  processing  differs  from  conventional  data  processing  in  three 
major  aspects: 

(1)  two-dimensional  image  data  arrays  require  large  amounts  of 
storage; 

(2)  high  on-line  processing  rates  are  needed  for  real-time  applica- 
tions; and 

(3)  operations  to  be  performed  on  the  data  are  usually  highly  paral- 
lel in  nature. 

Processing  of  image  data  by  conventional  general  purpose  computers, 
those  which  are  based  on  the  so  called  Von  Neumann  architecture  L2],  require 
enormous  amounts  of  computing  time.  The  unsuitability  of  the  architecture 
for  doing  these  tasks,  which  results  in  the  high  computing  costs,  is  based 
on  several  factors  C3].  Programs  and  data  are  stored  in  the  same  memory  unit 
and  all  operations  are  serially  executed.  Due  to  the  large  amount  of  image 
data,  main  memory  capacity  is  usually  exceeded,  resulting  in  large  amounts 
of  overhead  time  in  order  to  transfer  data  between  secondary  storage  and  the 
main  memory.  Finally,  input/output  transfers  are  slow  since  the  central  pro- 
cessor must  initiate  the  transfer  of  each  word.  Therefore,  by  developing  a 
specialized  computer  architecture  that  is  optimized  for  image  processing, 
real-time  cost-effective  image  processing  may  be  possible. 

The  second  motivation  is  due  to  the  technological  advances  that  have 
been  made  in  integrated  circuits,  along  with  the  "birth"  of  microprocessors. 
Since  microprocessors  are  becoming  less  expensive  and  more  powerful,  is  it 
possible  to  use  them  as  a means  of  providing  a large,  cost-effective  compu- 
tational capability  for  image  processing. 
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With  the  above  motivations  in  mind,  the  following  questions  will  be 
discussed. 

(1)  In  what  configuration  should  a multi-microprocessor  system  be 
organized? 

(2)  How  should  the  memory  be  allocated? 

(3)  How  many  microprocessors  are  needed  to  surpass  the  performance 
of  a conventional  computer  like  the  PDP-11/70  ? 

(4)  How  would  such  a system  compete  in  terms  of  performance  against 
a cellular  logic  array  built  for  image  processing? 

(5)  What  advantages  are  there  in  using  a multiprocessing  system  ? 

In  Chapter  II,  an  approach  to  image  processing  hardware  based  on  the 
concept  of  cellular  logic  arrays  will  be  evaluated.  A study  by  Duff,  Cordel- 
ia, and  Leviadi  C2J  comparing  the  parallel  processing  and  sequential  pro- 
cessing of  images  will  be  investigated  in  Chapter  III.  In  Chapter  IV,  anoth- 
er approach  to  image  processing  hardware,  a partitionabi e multi- 
microprogrammabl e microprocessor  system  will  be  examined.  In  Chapter  V,  per- 
formance issues  will  be  investigated.  Finally,  areas  for  further  research 
will  be  discussed  in  Chapter  VI. 
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Chapter  II.  Image  Processing  Cel  I ul ar  Logic  Arrays 

A cellular  logic  array  processor  consists  of  an  array  of  cells  or  sub- 
processors, where  each  cell  is  assigned  to  a pixel  (The  word  pixel  is  an  ab- 
breviation for  "picture  element".).  Each  cell  receives  inputs  from  an  exter- 
nal data  source  and  from  neighboring  cells.  A cell  contains  boolean  logic  to 
process  the  inputs  and  storage  for  saving  intermediate  results  prior  to  pro- 
ducing a final  output.  The  storage  for  grey  level  images  is  in  terms  of  en- 
coded bit  planes.  For  example,  an  8 level  grey  scale  image  can  be  stored  in 
3 bit  planes.  Two  outputs  are  available  from  each  cell.  One  connects  with 
neighboring  cells  and  the  other  is  used  to  output  the  processed  pattern. 

The  structure  of  a cellular  logic  array  is  best  suited  for  the  perfor- 
mance of  "local  operations"  on  image  neighborhoods.  A local  operation  on  an 
image  defines  a value  for  each  pixel  in  the  transformed  image  in  terms  of 
its  own  value  and  a small  set  of  its  neighbors.  Since  each  pixel  is  as- 
signed to  a cell,  all  the  values  of  the  new  pixels  in  the  transformed  image 
can  be  computed  in  parallel. 

The  optimization  of  a cellular  logic  array  structure  is  based  on  three 
factors; 

(1)  interconnection  pattern, 

(2)  internal  logic,  and 

(3)  internal  storage. 

An  array  which  has  a rich  interconnection  pattern,  powerful  internal 
logic,  and  large  amounts  of  storage  will  be  expensive  but  will  have  a capa- 
bility of  fast  and  sophisticated  processing.  On  the  other  hand,  if  the  array 
is  sparsely  interconnected  with  minimal  logic  and  storage,  each  cell  will  be 
simple  and  inexpensive  with,  however,  little  processing  capability. 
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Several  image  processes  have  been  designed  which  are  based  on  a cellu- 
lar logic  approach.  The  II  Mac  III  CS,6]  is  a digital  computer  which  was 
designed  for  automatic  scanning  and  analysis  of  homogeneous  image  data,  e.g. 
bubble-chamber  negatives.  The  system  consisted  of  3 parts:  image 
acquistion/dispi ay,  image  encoding  for  information  transmission,  and  clas- 
sification of  the  encoded  image.  A schematic  of  the  computer  is  shown  in 
Figure  (1).  It  was  recognized  in  the  initial  development  of  the  system  that 
the  data  rate  for  local  image  processing  would  create  a system  bottleneck. 
Therefore,  a special  processor  was  designed  to  permit  the  rapid  recognition 
and  description  of  an  input  image. 

This  unit,  the  Pattern  Articulation  Unit  (PAU)  was  to  perform  local 
preprocessing  on  the  input  image,  such  as  track  thinning,  gap  filling,  line 
element  recognition,  and  so  forth.  The  logical  design  was  optimized  for  the 
idealization  of  the  input  image  to  a line  drawing.  Nodes  representing  end 
points,  points  of  inflection  or  intersection,  among  others,  were  labled  in 
parallel  under  the  program  control  of  a unit  called  the  Taxicrinic  Processor 
(TP).  The  output  of  the  PAU  is  a graph,  in  a list  structure,  which  describes 
the  interconnection  of  the  labled  nodes. 

The  Taxicrinic  Processor  assembles  these  graphs  into  a coherent  list 
structure  subject  to  a recognition  grammer.  This  grammar  categorizes  the 
graph,  thus,  recognizing  the  contents  of  the  original  image.  In  addition  to 
controlling  the  operation  of  the  PAU,  the  TP  also  oversees  the  operation  of 
the  arithmetic  units  and  initiates  input/output  operations.  The  Arithmetic 
Unit  is  used  for  any  mathematical  operations  needed,  such  as  for  statistical 
anal ysis. 
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array.  An  input  image,  which  is  typically  1024  by  1024  pixels,  is  parti- 
tioned into  a lattice  of  32  by  32  pixel  windows.  The  image  is  processed 
serially  by  window  and  in  parallel  within  the  window.  Within  the  window,  an 
image  is  described  by  a set  of  bit  planes. 

In  Figure  (2),  the  structure  of  a stalactite  is  shown  in  simplified 
form.  Each  element  can  accept  an  input  from  itself  and  from  any  of  the  eight 
neighboring  elements  in  the  plane.  The  input  signals  are  ORed  together,  op- 
tionally complemented,  and  stored  in  one  or  more  of  the  nine  memory  planes. 
Communications  with  extra  planes  in  the  core  buffer  is  through  the  H plane, 
which  serves  as  the  buffer  register  of  this  memory.  The  outputs  of  any 
selected  set  of  planes  can  be  ANDed  or  ORed  to  get  an  output  signal  which 
can  be  optionally  complemented.  This  signal  is  then  passed  on  to  neighboring 
stalactites.  In  addition,  there  is  a signal  path  which  allows  an  input  sig- 
nal to  pass  through  the  element  without  iterim  sto'^age.  This  feature  allows 
path  building  with  the  array. 

The  common  set  of  control  lines  is  connected  to  each  stalactite. 


Thirty-one  instructions  are  provided.  They  can  be  grouped  into  the  following 
categories  concerned  with  (1)  loading  an  input  image  string  into  the  array 
or  generating  an  output  image  string,  (2)  planar  transfers  of  information, 
(3)  redefining  the  value  of  a pixel  in  a plane  on  the  basis  of  its  neighbors 
either  in  the  same  plane  or  in  corresponding  positions  for  neighboring 
planes,  and  (4)  the  input  or  output  of  associative! y derived  coordinate  in- 
formation. 

On  a historical  note,  the  II  Mac  III  computer  was  never  completed.  A 


fire  in  1967  destroyed  the  PAU  and  the  main  frame  of  the  computer.  At  a 
later  time,  the  project  was  abandoned. 
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CLIP,  an  acronym  for  Cellular  Logic  Image  Processor,  is  the  n.'ie  of  a 
family  of  cellular  logic  arrays  that  have  been  proposed  and  const  ucted  at 
the  University  College  London.  These  processors  have  been  built  in  an  at- 
tempt to  provide  a general  purpose  hardware  facility  for  image  processing 
studies. 

The  CLIP  family  consists  of  four  different  processors,  each  with  dif- 
ferent capabilities.  The  CLIP  1 C7]  served  to  test  the  feasibility  of  con- 
structing an  integrated  circuit  array  in  which  propagation  of  signals  into 
and  from  all  parts  of  the  array  might  take  place.  Three  functions  could  be 
implemented:  extraction  of  connected  objects,  extraction  of  object  boun- 
daries, and  extraction  of  the  contents  of  closed  loops. 

The  CLIP  2 C8D  was  the  first  programmable  array  to  be  constructed.  It 
consisted  of  a 16  by  12  hexagonal  array  of  192  symmetric  boolean  operators. 
Each  cell  had  two  inputs:  Aq  being  the  value  of  the  input  pattern  at  the 
cell  and  which  is  the  output  from  a NAND  gate.  The  inputs  to  the  NAND 

gate  are  the  A^^  outputs  from  the  six  neighboring  cells  in  the  array.  These 

binary  inputs  are  transformed  by  the  cel  I into  two  independent  binary  out- 

1 1 
puts,  A^  and  A The  A^  is  the  processed  pattern  and  the  A ^ connects  with 

the  six  neighboring  cells.  Figure  (3a)  is  a logic  diagram  of  the  cell.  Since 

communications  between  the  cell  and  its  neighbors  is  through  a NAND  gate, 

the  source  of  the  signal  is  never  identifiable.  Thus,  the  cell  is  non- 

directional  in  character. 

In  Figure  (3b),  the  layout  of  the  complete  system  is  shown.  An  input 
pattern  A^  is  stored  in  a 192  bit  shift  register  as  a sequence  of  0 and 
1 logic  states,  representing  the  white  and  black  parts  of  the  original  im- 
age. This  pattern  can  be  circulated  and  displayed  on  an  oscilloscope.  The 
output  pattern  A^  appears  on  the  192  output  leads  from  the  array  and  is 
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stored  in  the  shift  register.  The  contents  of  this  register  can  also  be 
displayed  on  the  oscilloscope.  In  addition  to  and  a third  192  bit 
memory  is  provided  for  storage  of  a pattern  while  another  pattern  is  being 
processed.  , 

The  CLIP  2 system  is  controlled  by  means  of  a 12  bit  instruction  word. 
There  are  two  types  of  instructions,  LOAD  and  PROCESS.  The  LOAD  instructions 
are  used  to  cycle  the  input  and  output  memories,  load  and  display  their  con- 
tents, and  for  transferring  the  contents  of  M.  or  M,  or  both.  The  PROCESS 
instructions  set  the  values  of  the  control  lines,  the  position  of  the  rout- 
ing switches  S^,  the  position  of  the  gate  switches  S^,  and  set  up  a logical 
0 or  1 on  the  interconnection  leads  which  extend  outside  the  array. 

Up  to  32  instructions  can  be  stored,  so  that  sequences  of  instructions 
can  be  executed.  The  instructions  are  entered  either  by  12  switches  or  by 
punched  tape. 

Since  the  CLIP  2 cell  represents  a compromise  over  a completely  general 
cell,  the  versatility  of  this  processor  was  severly  limited  due  to  the  na- 
ture of  the  cell  interconnections.  In  the  general  cell,  there  would  be  nine 
inputs,  two  outputs  and  a total  of  1024  control  lines.  In  the  CLIP  3 [9,10], 
a compromise  was  achieved  which  allowed  implementation  of  all  the  functions 
of  the  general  cell  by  means  of  short  sequences  of  functions,  but  which  did 
not  require  the  complexity  of  the  general  cell.  This  cell,  as  shown  in  Fig- 
ure (4a),  is  similar  to  the  CLIP  2.  It  is  preceded  by  a threshold  unit  fed 
by  the  8 neighbor  inputs;  each  input  is  individually  gated  into  the  thres- 
hold unit.  The  threshold  unit  has  3 control  inputs  which  set  the  threshold 
level,  and  8 control  inputs  which  select  a subset  of  the  neighbor  inputs  to 
be  summed  and  thresholded.  The  processor  requires  8 control  lines  as  before. 
Therefore,  19  control  lines  are  required  to  determine  the  function  to  be 
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I performed. 

The  array  structure  is  comprised  of  192  cells,  as  in  the  CLIP  2.  How- 
ever, the  array  interconnection  pattern  can  be  either  square,  with  eight 
connections  to  each  cell,  or  hexagonal,  with  six  connections  to  each  cell, 
or  hexagonal,  with  six  connections  to  each  cell.  The  choice  of  interconnec- 
tion is  under  control  of  the  programmer. 

The  two  patterns  presented  to  the  processor  are  held  in  the  A and  B re- 
gisters. Processed  patterns,  represented  by  D,  are  output  into  one  of  16 
memories.  The  interconnection  output  N appears  as  inputs  to  Ng  in  the 
neighboring  cells'  threshold  gates.  The  thresholded  sum  of  the  inputs  is  T. 
The  OR  gate  forms  the  logical  sum  B+T,  which  together  with  A,  provide  the 
input  to  the  processor.  Patterns  are  input  into  the  system  and  the  results 
are  output  from  the  system  via  the  A register.  Instructions  to  the  processor 
are  stored  in  a 256  word  RAM.  The  instructions  are  24  bits  long  and  are  of  5 
types:  PROCESS,  LOAD,  and  BRANCH  instructions  and  two  special  BRANCH  in- 
structions for  entering  or  leaving  subroutines. 

An  obvious  limitation  in  the  CLIP  3 is  the  low  resolution  obtained  with 
the  small  16  by  12  cell  array.  Also,  the  number  of  grey  levels  that  can  be 
handled  is  limited  by  the  number  of  D-arrays  available  at  any  given  time. 
Because  of  these  limitations,  a hybrid  CLIP  3 array  was  built  to  gain  ex- 
perience with  larger  image  areas.  A scanning  unit  was  designed  which  inter- 
faces the  CLIP  3 with  a television  camera.  Provision  is  made  to  threshold 
the  video  signal  and  digitize  a 96  by  96  array.  The  unit  scans  the  192  cell 
CLIP  3 array  across  the  96  by  96  data  field  and  provides  storage  to  handle 
signals  that  propagate  between  sectors.  The  scanning  process  is  complicated 
by  the  fact  that  a forward  scan  must  be  followed  by  a reverse  scan  to  allow 
for  propagations  in  all  directions  and  a check  scan  to  determine  that  all 
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propagations  have  taken  place.  The  complete  system  is  interfaced  to  a 
PDP-11/10  computer  which  provides  data  and  instruction  storage  and  also  pro- 
vides program  editing  and  assembling  facilities. 

Using  the  results  obtained  from  the  CLIP  3 and  the  hybrid  version,  a 
NHOS  LSI  processor  called  the  CLIP  4 L11,123  has  been  built.  It  is  a 96  by 
96  cell  array  with  several  changes  made  in  the  basic  cell  design.  The  struc- 
ture of  the  basic  cell  is  shown  in  Figure  (4b).  The  D storage  has  been  in- 
creased to  32  bits;  the  interconnection  threshold  gate  has  been  replaced  by 
an  OR  gate;  and  a few  extra  gates  and  an  additional  buffer  to  provide  an  au- 
tomatic carry  for  arithmetic  operations  has  been  included.  The  use  of  NMOS 
LSI  will  cause  a speed  reduction  in  the  CLIP  4 by  a factor  of  at  least  five 
compared  to  the  CLIP  3 which  used  HSI  TTL. 
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Chapter  III.  Eval uation  of  the  Cordel I a,  Duf and  Levial di  Study 

A study  that  was  performed  by  Cordelia,  Duff,  and  Levial di  l4]  compar- 
ing the  sequential  and  parallel  processing  of  images  is  discussed  in  this 
chapter.  The  sequential  processing  was  done  on  an  HP2116  minicomputer.  The 
parallel  processing  was  done  on  a hypothetical  processor,  based  on  the  per- 
formance of  the  CLIP  3. 

The  tasks  investigated  were  of  the  type  used  during  the  preprocessing 
phase  of  a pictorial  pattern  recognition  procedure.  These  tasks  include 
smoothing,  thresholding,  contour  extraction,  thinning,  and  pe''imeter  evalua- 
tion. An  assumption  made  in  the  study  was  that  the  images  had  a maximum  of 
32  grey  I evel s. 

A number  of  serious  problems  were  found  in  the  evaluation  of  their 
study.  The  first  area  concerns  the  use  of  clock  cycles  as  a basis  of  compar- 
ison. Unless  the  time  for  a clock  cycle  is  equal  in  both  p processors,  the 
results  are  not  a valid  indication  of  how  long  a task  takes.  For  example, 
consider  the  case  where  Processor  A has  a clock  cycle  of  1 microsecond  and 
Processor  B has  a clock  cycle  time  of  10  microseconds.  If  both  processors 
use  the  same  number  of  clock  cycles  for  a task,  then  it  is  obvious  that  Pro- 
cessor B will  take  10  times  as  long  as  Processor  A to  complete  the  same 
task.  However,  their  study  would  not  show  any  difference. 

The  second  problem  area  involves  how  realistic  the  study  is  in  terms  of 
present  or  near  future  technology.  In  their  study,  they  used  the  performance 
of  the  CLIP  3 as  a basis  of  comparison.  However,  it  is  not  feasible  at  this 
time  to  build  large  cellular  arrays  with  this  speed  performance.  For  exam- 
ple, using  MSI  TTL  technology  from  which  the  CLIP  3 was  built,  a 96  by  96 
array  would  require  16  racks,  each  6 feet  tall  by  19  inches,  plus  the  space 
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required  for  power  supplies.  As  discussed  in  the  previous  chapter,  in  order 
to  build  a 96  by  96  array  it  required  switching  to  an  LSI  NMOS  technology. 
This  resulted  in  a factor  of  five  loss  in  speed  for  LOAD/PROCESS  instruc- 
tions and  a factor  of  ten  loss  in  speed  for  propagation.  This  speed  loss  is 
due  to  the  fact  that  gate  propagation  delay  in  NMOS  is  much  greater  than  in 
TTL.  Although  it  is  possible  to  place  approxiately  1000  gates  on  a LSI  chip 
as  compared  to  100  for  MSI,  it  is  still  only  possible  to  place  8 cells  on 
one  chip.  Thus,  a 7 foot  tall  rack  is  required  to  hold  the  entire  array. 

As  mentioned  before,  Cordelia  et  al . assumed  that  the  images  were  lim- 
ited to  32  grey  levels.  However,  for  most  image  processing  situations,  grey 
levels  of  128  or  256  are  more  realistic.  In  fact,  it  was  found  that  by  ex- 
tending the  equation  they  derived  for  thresholding  C133  to  arbitrary  grey 
levels,  for  a 50  by  50,  256  grey  level  image,  assuming  equal  cycle  times,  it 
is  faster  to  compute  the  threshold  on  a sequential  computer  (Table  (1))  than 
on  a cellular  logic  processor.  This  is  due  to  the  fact  that  since  CLIP 
operations  are  low  level,  overhead  results  when  doing  higher  level  opera- 
tions. The  overhead  becomes  significant  when  a small  image  is  computed. 

Therefore,  considering  the  problems  presented  in  this  chapter,  the 
tremendous  speed  gains  (Figure  (5)),  determined  in  their  study,  must  be  tak- 
en in  the  proper  perspective. 
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Chapter  IV.  PartitionabI e Mul ti-MicroprogramiBabl e Microprocessor  System 

This  chapter  discusses  another  approach  to  image  processing  hardwore. 
Instead  of  having  a boolean  processor  for  each  pixel  in  the  image  as  in  the 
II  Mac  or  CLIP  processors,  a sophisticated  powerful  processor  is  used  for 
each  subsection  of  the  image.  Computations  are  performed  on  each  subsection 
of  the  image  sequentially,  however  the  computation  of  the  algorithm  is  ef- 
fectively done  in  parallel. 

If  the  size  of  each  subsection  of  the  image  is  allowed  to  shrink  to  a 
single  pixel,  the  result  is  a cellular  logic  array  computer  which  was  dis- 
cussed in  Chapter  3.  A specialized  computer  of  this  form  has  been  proposed 
for  the  numerical  solution  of  problems  in  fluid  mechanics  C14,15].  The  com- 
puter described  has  a fixed  interconnection  network  where  each  cell  can  only 
communicate  with  its  nearest  neighbors  "above"  and  "below"  and  to  the 
"right"  and  "left". 

The  approach  taken  in  this  paper  seeks  a balance  between  organizational 
complexity  and  performance.  Since  this  system  is  being  developed  for  a par- 
ticular application,  some  generality  can  be  sacrificed  for  an  improved  per- 
formance based  on  the  specific  requirements  of  the  a application.  Figure  (6) 
shows  the  overall  block  diagram  of  the  system.  Some  of  the  goals  of  this 
design  are  as  follows: 

(1)  use  multiple  microprocessors  operating  in  parallel  to  achieve 
high  performance  in  a cost-effective  manner, 

(2)  provide  high  speed  input/output, 

(3)  have  a flexible  memory  system  which  is  structured  to  minimize 
memory  contention,  and 

(4)  design  an  interconnection  network  that  allows  partitioning  of  a 
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group  of  microprocessors  into  smaller  groups  where  each  subgroup  can 
work  on  a different  task. 

The  system  can  be  broken  down  into  the  following  parts:  Memory,  Memory 
Management  and  1/0  Processor,  Interconnection  Network,  Microprocessors,  and 
Sequential  Controller.  Each  part  will  be  discussed  in  the  next  five  sec- 
tions. 
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Weitiory 

Memory  in  this  system  is  divided  into  two  types,  data  memory  and  in- 
struction memory.  Data  memory  is  split  up  into  N separate  modules,  where  N 
is  the  number  of  microprocessors  in  the  system.  Data  is  allocated  to  each 
module  in  terms  of  a row  major  form.  This  is  illustrated  in  Figure  (7).  Each 
module  contains  sufficient  memory  to  store  the  original  subsection  of  the 
image  as  well  as  storage  for  the  resulting  computation,  temporary  variables, 
and  a buffer  to  allow  communication  between  microprocessors.  The  exact 
amount  of  storage  will  be  a function  of  the  largest  image  to  be  processed 
and  the  number  of  microprocessors  in  the  system.  By  distributing  in  row  ma- 
jor form,  the  number  of  different  memory  modules  with  which  each  micropro- 
cessor must  communicate  can  be  minimized.  This,  in  effect,  reduces  the 
overhead  necessary  to  control  the  interconnection  network  between  micropro- 
cessors and  memory.  The  address  mapping  function  of  the  memory  management 
and  1/0  processor  is  simplified  since  the  data  is  mapped  into  sequential  lo-^ 
cations  in  each  memory  module.  A high  I/O  bandwidth  can  be  obtained  since 

all  I/O  transfers  will  be  done  using  direct  memory  access.  It  is  also  possi- 

ble to  use  a parallel  head  disk  arrangement  for  secondary  storage  which  will 
yield  fast  transfer  times. 

A possible  modification  of  the  data  memory  modules  can  be  made.  In  this 
arrangement,  each  memory  module  consists  of  two  submodules  where  it  is  pos- 
sible to  access  data  from  only  one  submodule  at  a time.  While  operations  are 
being  performed  on  the  data  in  one  submodule,  new  data  can  be  input  to  the 

other  submodule  or  results  of  a previous  process  can  be  output.  Thus, 

input/output  can  be  overlapped  with  tne  computational  process.  This  type  of 
arrangement  would  be  useful  in  a full  production  mode  of  operation.  This 


mode  would  correspond  to  the  situation  where  the  same  operations  are  being 
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Figure  (7)  Example  of  data  distributed  to  memory  modules  In  now  major  form 
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performed  on  many  images  of  the  same  type  and  high  throughput  is  needed.  An 
example  of  this  would  be  an  automated  white  blood  cell  count  system.  In  a 
research  oriented  environment,  this  modification  would  not  be  very  useful. 
This  is  due  to  the  highly  experimental  and  interactive  nature  of  the  compu- 
tations being  performed. 

The  instruction  memory  is  partitioned  into  four  modules.  Each  module 
consists  of  memory  along  with  sequencing  hardware  which  is  used  to  broadcast 
instructions  to  a partitioned  set  of  microprocessors.  These  modules  are 
loaded  by  the  sequential  controller  through  a single  bus.  If  the  system  is 
to  operate  with  a single  group  of  N microprocessors,  then  all  four  modules 
are  loaded  in  parallel  with  the  same  instructions.  If  two  tasks  are  to  be 
executed,  each  with  N/2  microprocessors,  then  two  modules  are  loaded  with 
the  instructions  followed  by  the  other  two  modules  being  loaded  with  the  in- 
structions for  the  other  task.  If  four  separate  tasks  are  to  be  executed, 
then  each  module  is  loaded  with  a different  task. 


- 25  - 


r 

I- 

t 

t 

[ Memory  Management  and  Processor 

I The  memory  management  and  I/O  processor  (MMIO)  is  responsible  for  the 

i 

\ allocation  of  data  to  each  memory  module,  control  of  the  direct  memory  ac- 

cess  process,  and  al I input/output  interfaces.  There  are  three  modes  of  data 
allocation  and  transfer  possible  for  the  MMIO. 

The  first  mode  of  operation  consists  of  a sequential  access  to  the 
memory  modules.  This  mode  is  best  described  in  terms  of  an  example.  Suppose 
data  from  a slow-scan  (15  field/second)  television  camera  scanner  C16n  is  to 
be  input  to  the  system.  It  is  assumed  that  the  image  is  256  by  256  pixels 
and  that  there  are  64  microprocessors  in  the  system.  The  MMIO  processor  will 

i 

operate  in  the  following  manner.  Upon  command  from  the  Sequential  Controll- 
er, the  MMIO  processor  first  determines  that  4 rows  of  256  pixels  each  are 
to  be  allocated  to  each  memory  module.  It  then  sets  up  a direct  memory 
transfer  for  1024  bytes.  The  memory  locations  where  the  1024  bytes  of  infor- 
mation are  to  be  stored  is  determined  by  the  mode,  the  internal  starting  ad- 
dress, and  the  memory  module  number.  The  actual  hardware  implementation  of 
this  will  be  given  following  the  description  of  the  other  modes.  In  the 
scanner,  data  is  sampled  and  digitized  on  a line  by  line  basis  with  256  pix- 
els obtained  per  horizontal  sweep.  The  scanning  process  is  initiated  by  the  j 

I 

MMIO  and  coincides  with  the  beginning  of  a new  television  field.  This  begin-  | 

ning  of  a field  is  denoted  by  a vertical  synch  pulse  preceeding  a horizontal 
synch  pulse.  Once  1024  pixels  have  been  input  to  the  memory  module,  the  MMIO 
processor  changes  the  address  to  the  next  memory  module,  sets  up  a block 
transfer  of  1024  bytes,  and  reinitiates  the  scanning  process  upon  receit  of 
the  next  horizontal  synch  pulse.  All  processor  changes  are  made  during  the 
horizontal  retrace  time  of  the  slow-scan  television  camera,  which  amounts  to 
approximately  50  microseconds.  Therefore,  it  is  possible  to  input  a 256  by  I 
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256  image  in  1/15  of  a second,  the  time  required  to  scan  one  television 
field. 

The  second  mode  of  operation  allows  for  the  broadcasting  of  input  data 
to  partitioned  subsets  of  the  memory  modules.  In  terms  of  the  example  just 
discussed,  suppose  the  system  was  partitioned  into  two  subsets  of  proces- 
sors, then  8 rows  of  256  pixels  are  allocated  to  each  memory  module.  Then, 
in  this  mode  of  operation  a block  transfer  of  2048  bytes  would  be  made  to 
memory  module  (i)  and  memory  module  (i+1)  simultaneously,,  where  i and  i+1 
represent  the  module  number. 

The  third  mode  of  operation  allows  for  the  parallel  transfer  of  data 
between  the  memory  modules  and  a parallel  head  disk.  This  type  of  transfer 
is  similar  to  that  used  in  the  Illiac  IV  computer  C17]. 

Figure  (8)  indicates  a proposed  hardware  design  that  allows  for  the  ad- 
dressing of  the  memory  modules  under  the  three  different  modes  of  operation. 
All  transfers  of  data  under  the  first  two  modes  occurs  over  a tri-state  bus. 
In  the  third  mode,  all  transfers  occur  over  a multiplexed  parallel  bus.  This 
system  allows  for  the  partitioning  of  the  microprocessors  and  memory  modules 
into  one,  two  or  four  groups.  The  addressing  of  the  memory  modules  is  accom- 
plished through  the  use  of  decoders,  and  decoder  outputs  are  used  as  enable 
lines  to  the  memory  modules.  Each  mode  of  operation  is  encoded  into  a set  of 
bits  which  are  used  to  enable  a particular  decoder.  Mode  2 has  two  possible 
bit  representations,  one  for  a two  group  partition  and  the  other  for  a four 
group  partition.  Mode  3 also  has  two  possible  bit  representations  since  spe- 
cial control  is  needed  for  the  multiplexed  parallel  bus.  One  representation 
sets  up  the  transfer  from  the  memory  modules  to  the  parallel  head  disk, 
while  the  other  sets  up  the  transfer  from  the  disk  to  the  memory  modules. 
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Interconnection  Network 


The  emphasis  in  the  design  of  this  system  has  been  to  provide  the  capa- 
bility for  use  of  Single  Instruction  Multiple  Data  or  Multiple  Instruction 
Multiple  Data  stream  modes.  This  capability  is  a function  of  the  versatility 
of  the  interconnection  network. 

The  simpi iest  type  of  interconnection  network  is  a single  bus.  In  this 
structure,  all  microprocessors  and  memory  modules  communicate  over  the  same 
pathway.  The  single  bus  connection  is  inexpensive  to  build  and  readily  ex- 
pandible,  but  has  the  disadvantage  that  only  one  microprocessor  is  allowed 
to  send  information  at  a time.  This  single  bus  then  becomes  a bottleneck  in 
system  performance. 

On  the  other  end  of  the  scale,  the  crossbar  switch  ranks  as  the  most 
versitile  interconnection  scheme.  It  allows  each  microprocessor  to  connect 
to  any  memory  module  as  long  as  that  module  is  not  already  connected  to 
another  processor.  The  cost  and  complexity  of  this  scheme  is  prohibitive, 
since  0(N**2)  switches  are  required  to  construct  such  a system,  when  there 
are  N microprocessors  and  N memory  modules  in  the  system. 

It  is  clear  that  a practical  interconnection  network  must  have  many  of 
the  capabilities  of  a crossbar  switch  without  the  enormous  cost.  Several 
interconnection  networks  that  fit  into  this  intermediate  category  have  been 
discussed  by  Siegel  C18D.  These  include  the  Perfect  Shuffle,  Cube,  II  Mac, 
Plus-minus  2**i  (PM2I),  and  Wrap-around  Plus-minus  2**i  (WPM2I).  Parallel 
processing  systems  that  incorporate  some  of  these  interconnection  networks 
have  been  constructed  C19,20,21]. 

Each  microprocessor  has  a unique  integer  in  the  range  0 to  N-1  associ- 
ated with  it  which  serves  as  its  address.  Similarly,  the  memory  modules  can 
be  addressed  by  the  same  range  of  integer  values.  Each  microprocessor  is  al- 
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lowed  to  directly  access  only  a subset  of  the  N memory  modules.  The  path 
between  a microprocessor  and  a memory  module  is  chosen  as  a function  of  the 
microprocessor's  address  bits.  Each  network  has  a different  function  set  of 
the  address  bits  to  choose  the  interconnection  paths.  For  example,  the 
Plus-minus  2**i  network  can  be  represented  by  the  following  2m  functions; 

t^.(i)=  j+2**i  mod  N 
+1  ■’ 

t_^(j)=  j-2**i  mod  N for  0<=  i < m 
where  m = log^  N and  j is  the  address  of  the  microprocessor.  This  intercon- 
nection network  allows  access  to  an  arbitrary  memory  module  in  at  most  m 
steps. 

As  mentioned  before,  the  interconnection  scheme  must  have  the  capabili- 
ty to  partition  the  microprocessors  and  memory  modules  into  subsets  so  mul- 
tiple instruction  streams  can  be  implemented.  The  Plus-minus  2**i  network  is 
a possible  canidate  since  it  allows  partitioning  of  the  microprocessors  and 
memory  modules.  For  example,  as  shown  in  Figure  (9),  if  only  functions  t^^, 
t^2/  t_2  are  used,  the  microprocessors  and  memory  modules  are  par- 

titioned into  two  separate  groups.  If  only  the  functions  t^2  s^d  t_2  are 
used,  four  seoarate  groups  are  formed. 

Since  this  entire  system  is  being  structured  for  image  processing  use,  j 

• 

certain  image  processing  features  must  also  be  considered  in  the  choice  of  j 

i 

the  interconnection  network.  For  example,  suppose  each  memory  module  con-  j 

tains  only  one  row  of  the  picture  matrix.  In  order  for  microprocessor  (i)  to 
perform  a local  operation  based  on  a 3 by  3 pixel  window,  it  must  directly 
access  memory  modules  i,  i-1,  and  i+1  to  take  full  advantage  of  the  poten- 
tial throughput  of  the  system.  When  a 5 by  5 window  is  used,  direct  access 
to  memory  modules  i-2  and  i+2  would  also  be  needed.  The  PM2I  and  WPM2I  in- 
terconnection networks  are  the  only  ones  presently  suggested  which  allow 
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Figure  (9)  Partitions  available  in  the  PM2I  intercomnenctlon  network 
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direct  accesses  of  this  form.  However,  it  is  not  possible  to  partition  the 
WPM2I  network. 

Another  computation  that  is  sometimes  used  in  image  processing  is  the 
Fast  Fourier  Transform  (FFT).  Pease  C22D  and  Stone  C23D  have  shown  how  to 
compute  this  algorithm  in  parallel  using  a Perfect  Shuffle.  This  computa- 
tion is  accomplished  through  a sequence  of  operations  performed  first  on 
pairs  of  numbers  whose  binary  representation  of  their  indices  differ  by 
then  on  those  that  differ  by  z"*  and  so  forth,  to  those  that  differ 
by  2*^.  The  Cube  and  PM2I  interconnection  networks  allow  the  same  pairing  of 
data,  so  it  is  possible  to  compute  this  algorithm  directly  on  any  system  us- 
ing one  of  these  interconnection  networks. 

Since  the  PM2I  network  fulfills  the  data  manipulation  requirements,  it 
will  be  used  in  the  design  of  the  system.  Its  implementation  will  differ 
from  previous  designs.  For  example,  Feng's  data  manipulator  C19]  was  imple- 
mented using  log2  N stages  of  PM2I  functions.  The  implementation  that  will 
be  used  in  the  system  will  be  a recirculating  stage.  One  pass  through  this 
stage  will  allow  a microprocessor  to  access  a memory  module  with  any  of  the 
following  addresses;  j,  j+2^,  j-2^,  j+2^ , j+z"  \ where  j is  the 

microprocessor  address.  Access  to  an  arbitrary  memory  module  can  occur  after 
at  most  m recirculations  through  the  stage.  Each  microprocessor  will  have  a 
recirculation  buffer  to  facilate  such  transfers. 

Figure  (10)  shows  a possible  hardware  implementation  of  the  recircula- 
tion stage  connected  to  a microprocessor.  Only  one  bit  of  the  data,  address 
and  control  buses  is  shown.  Tri-state  buses  are  used  extensively. 

Control  over  the  interconnection  network  is  viewed  on  two  levels,  a 


logical  level  and  a physical  level.  When  the  system  is  partitioned  into  only 
one  group,  both  levels  coincide.  If  partitioned  into  two  groups  the  func- 
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t1ons  t^Q  3nd  t_g  are  not  allowed.  However^  on  a logical  level  we  would  like 
to  use  these  functions  to  provide  a consistent  view  of  the  interconnect'on 
network  in  terms  of  our  algorithms.  Thus,  the  algorithms  become  partition 
independent.  The  translation  between  the  logical  level  and  the  physical  lev- 
el can  be  accomplished  in  two  possible  ways.  One  way  would  be  during 
language  compilation.  An  interconnection  program  control  statement  would  be 
compiled  based  on  the  global  partition  information.  Another  possible  imple- 
mentation of  the  control  would  be  through  a conditional  branch  in  the  micro- 
code which  is  set  up  by  the  partition  information. 
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Microprocessor 

In  the  context  of  this  report,  the  term  microprocessor  refers  to  a pro- 
cessing module  which  is  built  out  of  a set  of  LSI  chips  that  are  used  as 
building  blocks  in  its  construction.  Each  chip  is  usually  a 2 or  4 bit  wide 
section  or  slice  of  a functional  unit  which  allows  cascading  to  form  systems 
with  word  lengths  of  up  to  64  bits.  Coordinated  control  of  these  chips  is 
through  user  microprogramming.  The  use  of  this  type  of  microprocessor  in- 
stead of  more  conventional  versions,  such  as  Intel's  8080  or  Motorola's 
M6800,  is  based  on  a number  of  factors. 

These  factors  include: 

(1)  high  processing  speed, 

(2)  capability  to  optimize  the  system  instruction  set  for  a particu- 
lar application,  and 

(3)  capability  to  upgrade  system  through  addition  of  new  features  or 

better  performance  by  modification  of  the  microcode. 

Some  disadvantages  of  using  these  bit  slice  microprocessors  are: 

(1)  higher  cost, 

(2)  very  little  software  support,  and 

(3)  higher  power  consumption. 

The  high  processing  speed  is  based  on  the  type  of  technology  used  in 
their  construction  and  the  minimization  of  gate  propagation  delays  due  to 
the  high  packing  density  of  logic  gates  per  chip.  In  Figure  (11),  the 
throughput  capability  of  various  microprocessors  using  a particular  instruc- 
tion mix  is  compared  with  a missle  guidance  and  control  computer  system  C24] 
based  on  the  AM2901A  bit  slice  microprocessor  designed  by  Advanced  Micro 
Devices  C25].  The  throughput,  which  is  measured  in  thousands  of  operations 
per  second  (KOPS),  is  at  least  an  order  of  magnitude  greater  than  its 
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nearest  competitor. 

The  following  is  a list  of  currently  available  bit  slice  microproces- 
sors. Some  of  the  bit  slice  microprocessors  are  organized  around  a family 
of  support  chips. 

Intel  3000  series  family  C26]  - Intel  Corporation 

MMI  6700  series  family  C27,28]  - Monolithic  Memories  Incorporated 

AM2900  series  family  C25D  - Advanced  Micro  Devices 

SN74S481,  SBP0400A  and  SBP0401A  C29]  - Texas  Instrument 

M10800  series  family  C303  - Motorola 

A brief  description  of  each  bit  slice  microprocessor  and  their  associ- 
ated microprogram  control  unit,  if  it  exists,  will  be  given. 

The  Intel  3002  is  the  central  processing  element  (CPE)  of  the  family. 
It  is  a two  bit  slice.  Each  CPE  (Figure  (12a))  is  organized  with  five  in- 
dependent busses,  three  for  input,  two  for  output.  A sixth  bus  is  used  for 
control  of  the  CPE.  This  control  allows  the  performing  of  over  40  boolean 
and  binary  functions.  The  CPE  has  a cycle  time  of  150  nanoseconds. 

The  Intel  3001  (Figure  (12b))  is  the  microprogram  control  unit.  The  ad- 
dress capabilities  of  the  3001  (MCU)  are  unique.  Microprogram  addresses  are 
organized  as  a two-dimensional  array  or  matrix.  A 9 bit  address  specifies 
the  row  address  with  the  upper  5 bits  and  the  column  address  with  the  lower 
4 bits.  From  a particular  row  or  column  address,  it  is  possible  to  jump  ei- 
ther unconditionally  to  any  location  in  that  row  or  column  or  conditionally 
to  a specified  subset  of  locations,  in  one  operation.  The  MCU  has  a cycle 
time  of  700  nanoseconds.  For  a 16  bit  system  with  a pipelined  architecture, 
a microinstruction  cycle  time  of  150-2(X)  nanoseconds  can  be  obtained. 

The  MMI  6701  (Figure  (13a))  is  a 4 bit  Schottky  LSI  microprocessor 
slice.  Thirty-six  microinstructions  are  used  to  control  arithmetic,  logical. 
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and  shifting  operations.  Overflow  detection  is  provided.  The  microprocessor 
slice  can  work  with  either  positive  or  negative  logic.  There  are  16  directly 
addressable,  two  port,  general  purpose  accumulators  with  two  address  capa- 
bilities. The  two  address  capability  allows  for  working  on  two  accumulators 
at  once.  Three  register  operations  are  available.  The  Q register  is  used  as 
either  a scratchpad  or  accumulator  extension.  The  MMI  6701  has  a cycle  time 
of  200  nanoseconds. 

The  MMI  6700  (Figure  (13b))  is  the  microprogram  controller  (MC)  or 
sequencer.  The  MC  can  be  used  with  RAM,  ROM,  or  PROM  and  can  directly  ad- 
dress up  to  512  words  of  cntrol  memory.  A number  of  instructions  are  avail- 
able for  conditional  and  subroutine  jumps.  A multi-way  branching  capability 
at  each  microinstruction  is  provided.  The  MC  has  in  addition  to  a single 
level  of  subroutine,  a control  counter  which  allows  the  repetition  of  mi- 
croinstructions. In  a system  without  pipelining,  the  microinstruction  cycle 
time  is  approximately  250  nanoseconds. 

In  the  AM2900  family,  there  are  currently  two  microprocessor  slices 
available,  the  AM2901  and  the  AM2901A.  The  AM2901A  is  a 20-30X  faster  ver- 
sion of  the  AM2901.  The  AM2901  (Figure  (14a))  and  the  MMI  6701  microproces- 
sor slices  are  very  similar  in  organization.  Each  have  the  same  register  and 
Q register  organization.  Both  allow  two  operands  to  be  read  from  the  regis- 
ter file,  have  an  operation  performed  in  the  ALU,  shifted  and  written  back 
into  the  register  file  during  oiie  microinstruction  time.  In  parallel  with 
this,  the  ALU  output  and  Q registers  can  be  right  shifted.  The  main  advan- 
tage of  the  AM2901  over  the  MMI  6701  is  speed.  It  has  a cycle  time  of  105 
nanoseconds.  The  AM2901A  has  a cycle  time  of  70  nanoseconds. 

The  AM2909  and  AM2911  are  microprogram  sequencers  (FiguredAb) ) . They 
are  4 bit  slices  which  are  cascadable.  The  AM2909  can  select  an  address  from 
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Figure  (14a)  AM  2901  four  bit  microprocessor  slice 


four  possible  sources.  These  are  a set  of  external  direct  inputs  (D);  exter- 
nal data  from  the  R input,  stored  in  an  internal  register;  a four  word  deep 
push/pop  subroutine  stack;  or  a program  counter  register.  Each  of  the  four 
outputs  can  be  OR'ed  with  an  external  input  for  conditional  skip  or  branch 
instructions,  and  a separate  line  can  force  the  outputs  to  all  zeros. 

The  AM2911  is  identical  to  the  AM2909  except  the  OR  gates  are  removed 
and  the  D and  R inputs  are  tied  together. 

The  microinstruction  cycle  time  for  a system  built  up  in  a pipelined 
fashison,  should  be  approximately  100  nanoseconds. 

A new  4 bit  microprocessor  slice,  the  AM2903  (FiguredS) ),  should  be- 
come available  in  November,  1977.  The  AM2903  will  be  able  to  perform  the 
same  functions  of  the  AM2901A  and  also  provide  powerful  enhancements  for  use 
in  arithmetic-oriented  processors.  It  will  have  an  infinitely  expandable^ 
memory  and  three  port,  three-address  architecture.  The  AM2903  has  built-in 
multiplication,  division  and  normalization  logic  along  with  parity  genera- 
tion and  sign  extension  circuitry.  This  will  allow  easy  implementation  of 
multiplication,  division,  normalization  of  floating  point  numbers  and  other 
previously  time-consuming  operations. 

The  SN74S481  (Figure  (16a))  is  a 4 bit  expandable  Schottky  microproces- 
sor slice.  Some  of  its  architectural  features  include  parallel  dual 
input/output  ports,  full-function  ALU  with  carry  look-ahead,  magnitude,  and 
overflow  decision  capabilities,  dual  memory  address  generators,  and  double- 
length accumulator  with  shifting  and  sign-bit  handling  capabi I ities.  Asyn- 
chronous access  to  data  routing  and  counter  updating  controls  is  provided. 
In  a single  microinstruction,  it  is  possible  to  perform  an  ALU  function  with 
a shift,  select  destination  with  address/iteration  updating,  plus  address 
and  present  data  to  memory.  Pre-programmed  multiply,  divide,  and  CRG  algo- 


Figure  (16b)  Texas  Instrument  SBP0l4WA/SBPj!l*ilA 


rithms  are  provided.  The  microinstruction  time  is  100  nanoseconds. 

The  SBP0400A  and  SBP0401A  (Figure  (16b))  are  4 bit  expandable  micropro- 
cessor slices  built  out  of  Integrated  Injection  Logic  (I^L).  Each  have 
separate  data-in,  data-out,  address-out  and  control  ports.  Sixteen  functions 
are  provided  in  the  ALU  along  with  a carry  look-ahead  capability.  Some  other 
features  are  8 general  registers  including  a program  counter  with  indepen- 
dent incrementer,  two  working  registers  and  shifters  with  on-chip  handling 
of  end  conditions.  The  major  difference  between  the  two  microprocessor 
slices  is  the  SBP0400A  has  an  on-chip  pipeline  operation  register  while  the 
SBP0401A  is  designed  for  use  in  an  externally  pipelined  system. 

Both  processors  have  a wide  performance  range.  A constant  speed-power 
product  can  be  obtained  over  an  injector  current  range  covering  three  oord- 
ers  of  magnicude  with  a typical  ALU/shift  operation  of  240  nanoseconds  at 
200  fflW  nominal  power. 

The  M10800  family  is  built  out  of  Emitter  Coupled  Logic  (ECL).  The 
NC10800  (Figure  (17))  is  a cascadable  4 bit  ALU  slice.  This  chip  can  perform 
logic  operations,  binary  and  BCD  arithmetic,  and  both  logic  and  arithmetic 
shifting.  An  internal  accumulator  is  available  for  temporary  storage.  A spe- 
cial mask  network  allows  bit  masking  of  data  before  arriving  at  the  ALU. 
Three  independent  data  ports  are  provided.  Two  ports  are  input/output  while 
the  other  is  input  only.  The  following  arithmetic  and  status  outputs  are 
provided:  overflow,  sign,  zero,  carry  out,  group  propagate,  group  generate, 
parity  of  carries  and  parity  of  results. 

The  MC10801  (Figure  (18))  is  used  for  the  microprogram  control  func- 
tion. It  is  a 4 bit  cascadable  slice.  A status  register,  instruction  regis- 
ter, 4 level  subroutine  stack,  address  register,  retry  register  and  incre- 
menter are  included  on  the  chip.  Sixteen  instructions  are  available  for  use 
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in  generating  the  next  control  memory  address.  Some  of  these  instructions 
include  increment,  direct  jumps  to  various  inputs  and  registers,  subrouting, 
conditional  jumps,  and  a special  instruction  for  multipath  branching. 

The  MC10803  (figure  (19))  provides  the  memory  interface  function.  This 
device  has  six  registers  for  uses  as  memory  address  register,  memory  data 
register,  program  counter,  stack  pointer,  index  register,  or  other  func- 
tions. Seventeen  data  transfer  instructions  are  provided.  Since  an  ALU  is 
included,  memory  address  generation  under  various  addressing  modes,  can  take 
place  completely  on  this  chip. 

The  MC10808  (Figure  (20))  is  a bit  programmable  multi-bit  shifter. 
This  device  provides  a very  fast  shift  network  that  is  essential  in  floating 
point  operations  for  prenormal ization  or  alignment  of  exponents. 

The  combination  of  the  above  devices  into  a system  can  provide  microin- 
struction times  of  less  than  100  nanoseconds  depending  on  the  system  archi- 
tecture and  maximum  path  delay. 

In  order  to  get  an  estimate  of  the  potential  power  of  a multi- 
microprocessor system,  it  would  be  useful  to  compare  the  processing  power  of 
these  bit  slice  microprocessors  to  that  of  a computer  like  the  PDP  11/40  or 
PDP  11/70.  A study  was  made  at  Carnigie  Mellon  University  C313  where  a com- 
plete equivalent  PDP  11/40  was  constructed  using  Intel's  3000  series  bit 
slice  family.  The  results  of  that  study  indicated  that  the  processing  speed 
of  the  equivalent  system'  was  63%  the  speed  of  the  PDP  11/40.  A careful  in- 
vestigation of  the  performance  showed  a number  of  pitfalls  due  to  the  choice 
of  Intel's  3000  series  bit  slices  in  the  implementation  of  the  system.  If 
the  system  had  been  rebuilt  using  the  AM2900  series  bit  slice  microproces- 
sor, the  integrated  circuit  package  count  could  be  reduced  from  144  to  95 

! 

and  the  overall  performance  boosted  to  the  level  of  a PDP  11/40. 


Figure  (19)  Motorola  NC  IO803  men»ry  interface  unit 
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Figure  (20)  Motorola  MC  IO808  16  bit  programmable  multl-blt  shifter 
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Sequential  Control  I er 


The  sequential  controller  is  used  for  program  development  and  coordina- 
tion of  the  multi-microprocessor  system.  Since  its  operation  is  fairly  con- 
ventional, it  will  be  a convential  computer  system,  like  the  PDP  11/45. 

A job  control  language  will  be  used  to  specify  commands  to  the  various 
subsystems  of  the  multi-microprocessor  system.  An  example  of  two  possible 
commands  are  "DISPLAY  IMAGE  (.)"  and  "PARTITION  (.)"  where  represents 
an  arguement  list. 

The  DISPLAY  IMAGE  command  with  the  appropriate  arguements  would  be  sent 
by  the  Sequential  Controller  to  the  Memory  Management  and  I/O  processor  for 
decoding.  Once  decoded,  the  MMIO  would  supervise  the  display  of  an  image 
with  respect  to  the  given  parameters.  The  PARTITION  command  would  serve  as  a 
global  control  over  the  partitioning  of  the  data  memory  modules,  intercon- 
nection network,  and  instruction  memory  modules.  These  two  commands  are  but 
two  of  the  many  possible  commands  the  system  will  use. 

In  addition  to  the  coordination  activities  of  the  Sequential  Controll- 
er, it  is  also  used  for  program  generation,  compilation,  and  instruction 
memory  loading.  Loading  of  the  instruction  modules  takes  place  through  a 
direct  memory  access  channel. 

The  Sequential  Controller  is  not  used  for  any  image  processing  computa- 
tions, since  it  would  become  a potential  bottleneck  in  the  system  operation. 


A two  part  investigation  has  begun  to  estimate  the  execution  time  re- 
quired for  various  image  processing  tasks.  The  first  part  of  the  study  in- 
volves implementation  of  the  tasks  on  a POP  11/70  computer  and  comparing 
these  results  with  those  obtained  for  a single  bit  slice  microprocessor  sys- 
tem. The  comparison  is  based  on  the  total  time  required  to  complete  the  task 
as  a function  of  the  size  of  the  image.  Some  preliminary  results  have  been 
obtained  for  image  processing  tasks  that  operate  on  local  neighborhoods.  A 
program  has  been  written  that  spatially  filters  or  smoothes  image  data.  Ap- 
pendix I contains  the  program  listing,  along  with  a timing  equation  derived 
from  the  program.  Using  the  instruction  set  of  the  previousily  mentioned 
missile  guidance  computer  system,  a timing  equation  was  obtained  for  a pro- 
gram which  accomplishes  the  same  task.  Since  the  basic  instructions  are  very 
similar  to  those  of  an  HP  2116  computer  C32]  and  a program  listing  of  the 
same  task  was  available,  the  problem  of  developing  the  timing  equation  was 
simplified.  In  Appendix  I,  the  instruction  list  along  with  the  execution 
time  for  each  instruction  is  given.  Also  included  is  the  program  listing 
along  with  the  timing  equation  obtained. 

Table  (2)  shows  the  processing  time  required  under  both  systems  using 
various  image  sizes.  A comparison  of  these  results  shows  that  the  system 
built  using  the  AM2900  series  bit  slice  microprocessor  is  approximately  29% 
slower  than  the  POP  11/70.  It  is  estimated  that  a similar  performance  effi- 
ciency can  be  obtained  for  operations  such  as  edge  enhancement,  thinning, 
and  other  computations  based  on  local  neighborhoods. 

The  second  part  of  this  investigation  is  to  determine  how  these  same 
algorithms  can  be  implemented  on  the  multi-microprocessor  system.  Since  the 


Execution  Times  of  Smoothing  Program  for  Square  Matrices  of  Side  M 


Time  (Seconds) 


M 

POP  11/70 

Microprocessor 

64 

CO 

0 

• 

1 

.124 

128 

.352 

.495 

256 

1.43 

1.98 

512 

5.74 

7.91 

1024 

23.04 

31.67 

AM  2900  Microprocessor  System  is  28.6%  Slower  Than  PDP  11/70 


Table  (2)  Comparison  of  execution  times  for  smoothing  program 
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exact  structure  of  each  microprocessor  module  has  not  yet  been  determined, 
this  work  will  help  in  determining  a structure  which  optimizes  computation 
time. 

As  discussed  in  Chapter  IV,  the  ability  to  compute  parallel  FFTs  and  to 
use  local  neighborhood  algorithms  resulted  in  the  choice  of  the  PM2I  inter- 
connection network  for  this  system. 

Investigations  into  the  structure  of  the  microprocessor  modules  have 
not  yielded  any  specific  results,  and  have,  in  fact,  raised  many  more  new 
questions.  For  example,  the  amount  of  local  storage  and  the  size  of  the  re- 
circulation buffer  must  be  determined.  Is  floating  point  hardware  needed? 
What  type  of  performance  gain  is  possible  if  automatic  hardware  address  gen- 
eration C333  is  used?  Finally  and  perhaps  the  most  important  question  that 
must  be  answered  deals  with  the  type  of  language  support  the  system  should 
have.  How  should  control  of  the  interconnection  network,  active-inactive 
Status  C343  of  the  microprocessor  modules,  and  data  transfers  be  specified 
in  this  language?  Also,  since  the  system  will  be  microprogrammed,  should  the 
language  support  a dynamic  microprogramming  capability?  Much  further 
research  is  needed  into  these  areas  before  an  optimized  structure  can  be 
developed  for  this  subsystem. 
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Chapter  VI.  Conclusions  and  Future  Work 
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In  this  report/  we  have  attempted  to  show  how  image  processing  differs 
from  conventional  computer  processing  and  why  a specialized  computer  archi- 
tecture is  needed  for  it.  The  use  of  cellular  logic  arrays  for  image  pro- 
cessing was  discussed.  Although  cellular  logic  arrays  can  be  computationally 
very  powerful/  algorithm  construction  is  very  difficult/  input/output  is 
slow/  and  it  is  only  feasible/  at  the  present  time/  to  build  small  arrays. 
Another  problem  is  system  reliability.  Since  these  systems  have  not  been 
designed  to  isolate  non-functioning  parts  of  the  array,  if  one  or  more  cells 
are  inoperative/  the  logic  array  can  produce  completely  erroneous  results. 

Another  approach  to  the  solution  of  the  image  processing  computation 
was  introduced  in  the  remainder  of  this  report.  The  partitionabi e multi- 
microprogrammabl e microprocessor  system  is  an  attempt/  using  current  tech- 
nology/ to  provide  a reliable/  flexible/  and  easy  to  use  system  for  image 
processing. 

The  overall  architecture  of  this  system  was  heavily  influenced  by  the 
computational  requirements  of  various  image  processing  tasks.  The  system 
should  attain  a high  degree  of  reliability  since  it  is  partitionabi e.  If  a 
failure  occurs/  the  same  task  can  be  executed  on  a smaller  partition/  by 
bypassing  the  malfunctioning  device  or  devices.  It  was  shown  that  bit  slice 
microprocessors  are  needed  to  provide  sufficient  system  flexibility  and  com- 
putational power.  In  order  for  this  system  to  be  easy  to  use  and  efficient/ 
considerable  research  is  needed  in  the  development  of  a higher  level 
language  that  can  be  mapped  onto  the  system  hardware.  In  the  past/  most  com- 
puter systems  were  designed  first/  with  development  of  a higher  level 
language  as  an  afterthought.  This  has  proved  to  be  disasterious  in  the  cases 


1 


i! 

|i 


r 


I 

I 


- 56  - 


of  the  Kliac  IV  and  Staran  computers.  The  Burroughs  B5700/6700  series  com- 
puters C3AD  on  the  other  hand  are  examples  of  systems  whose  architecture  is 
consistent  and  facilitates  the  execution  of  algorithms  written  in  a high 
level  language/  Extended  Algol  60.  In  this  system^  a similar  unified  ap- 
proach will  be  taken  where  the  architecture  and  software  are  jointly 
developed.  Further  work  must  be  done  in  determining  the  computational  re- 
quirement of  various  image  processing  algorithms  and  how  they  can  be  imple- 
mented on  this  system.  Al so,  the  actions  and  interactions  of  the  various 
subsystems  must  be  simulated  to  determine  global  weakness  or  deficiencies  in 
this  system. 
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Appendix  I 


The  execution  time  [36]  for  an  instruction  on  the  PDP  li/70  depends 
on  the  Instruction  itself,  the  modes  of  addressing  used,  and  the  type 
of  memory. 

The  basic  instruction  set  timing  is: 

Double  Operand 

all  instructions, 

except  MOV:  Instr.  Time  » SRC  Time  + DST  Time  + EF  Tine 
MOV  instruction:  Instr.  Time  * SRC  + EF 

Single  Operand 

all  instructions:  Instr.  Time  * DST  Time  + EF  Time  or 
Instr.  Time  » SRC  Time  + EF  Time 

Branch,  Jump,  Control,  Trap  and  Misc. 

all  instructions:  Instr  Time  * EF  Tine 

The  following  charts  are  used  in  determining  the  execution  time 
for  an  instruction. 
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Singl*  Operand 


Instruction 

(Use  with  DST  Timfe) 

CLR,  COM.  INC,  OEC. 
ADC,  SBC,  ROL, 

ASL,  SWAB,  SXT 

NEG 

rir 

ROR,  ASR 
ASH.  ASHC 


NOTE  (H):  Add  0.15  ^sec  if  odd  byte. 
NOTE  (I);  Add  0.15  ^sec  per  shift. 
NOTE  (J):  Add  0.30  ^sec  if  DST  is  R7. 


Instruction 

(Use  with  SRC  Times) 

EE  Time 

Read 

Memory 

Cycles 

MUL 

3.30 

1 

DIV 

by  zero 

.90 

1 

shortest 

7.05 

1 

longest 

8.55 

1 

Instruction 

EF  Time 

Read 

Memory 

Cycles 

MFPI 

1.50 

1 

use 

MFPD 

1.50 

1 

with 

SRC 

times 

60 


r 


Instruction 

DST 

Mode 

Instruction  Time 

Read 

Memory 

Cycles 

MTPI 

0 

.90 

1 

MTPD 

1 

1.65 

2 

2 

1.65 

2 

3 

2.10 

3 

4 

1.80 

2 

5 

2.25 

3 

6 

2.10 

3 

7 

2.55 

4 

Branch  Instructions 


Read 

Instr  Time 

Instr  Time 

Memory 

Instruction 

(Branch) 

(No  Branch) 

Cycles 

BR,  BNE.  BEQ. 

BPL,  BMI,  BVC, 

BVS,  BCC.  BOS, 

BGE,  BLT,  BGT, 

BLE,  BHI,  BIOS, 

BHIS,  BLO 

.60 

.30 

1 

SOB 

.60 

.75 

1 

Jump  Instructions 


Instruction 


DST 

Mode 


Instr  Time 


Read 

Memory 

Cycles 


JMP 


1 

2 

3 

4 

5 


.90 

.90 

1.20 

.90 

1.35 


Control,  Trap  & Miscellaneous  Instructions 


( ^ Read 

' Memory 

! Instruction  Instr  Time  Cycles 


RTS 

1.05 

2 

MARK 

.90 

2 

RTI,  RTT 

1.50 

3 [] 

SET  N,  Z,  V.  C 

CLR,  N.  2,  V,  C 

.60 

1] 

1 -j 

HALT 

1.05 

0 

WAIT 

.45 

0 

WAIT  Loop 
for  8 BR  is 

1 

.3  ^sec. 

RESET  10ms  1 

lOT,  EMT.  3.30  3 

TRAP,  BRT 
SPL 

INTERRUPT 
First  Device 


.60 

2.31 


1 

2 
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The  following  assembly  language  program  was  written  for  the 

POP  11/70  computer  to  spatially  filter  or  smooth  data  In  an  M X N 

array.  The  total  execution  time  Is  equal  to  the  sum  of  t 

t , and  t-,  where: 
al , Cl 


*INIT 


t 


Cl 


“ 25.95  microseconds 

1.65  - .9  microseconds 

1.65  mn  - 3.3n  + 4.35  m - 8.7  microseconds 

20.4  (m-2)  (n-2)  microseconds. 


T^=  22.05  mn  - 41.1  n - 34.8m  + 72  microseconds 


! 


I 


IL  1 
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FILTER  macro  vlAUG7S  18-tiov-77  92:55  pa^«  1 


2 990909 


916567  filter: 
999992 


999196 

3 099906  916567 

999904 

900102 

4 999914  917567 

999996 

099966 

5 999922  917567 

990010 

900962 

6 999939  917567 

999919 

990959 

7 999936  906367 

999944 
fi  999942  966767 
999949 
990944 

9 090050  062767 

999002 

990936 

10  99956  966767 

999924 

990932 

11  09964  962767 

090O92 

999924 

12  09972  912793 

999992 

13  00076  912704 

900992 


.title  fl Iter 
: mov  2(r5)#xaddr 


mov  4(r5)^'<addr 

mov  96<  r5  )/m 

mov  01 9( rS  ) r n 

mov  019<r5)/incr 

asl  irtcr 

add  incr^xaddr 

add  42«xaddr 

add  incrf^addr 

add  OZr'^addn 


♦2,r3 

♦2,r4 


14 

09192 

900167 

999914 

Jmp 

al 

15 

99196 

990909 

i ncr : 

. word 

9 

16 

99110 

000000 

m: 

• word 

0 

17 

99112 

000000 

n: 

.word 

0 

18 

99114 

000009 

xaddr : 

. word 

9 

19 

09116 

OOC'CC’O 

'^addr : 

. word 

0 

20 

99129 

999900 

t«mp  * 

. word 

9 

21 

00122 

926793 

177762 

al  : 

cmp 

m/  r3 

: last  row? 

22 

90126 

991091 

bne 

bl 

}no 

23 

99139 

990207 

rt* 

pc 

J return  from  subroutine 

24 

,99132 

926794 

177754 

bl : 

emp 

n/r4 

llast  col? 

25 

99136 

001913 

bne 

cl 

; no 

26 

99149 

912794 

990992 

mov 

♦2,r4 

27 

99144 

095203 

i nc 

r3 

/set  address 

28 

00146 

962767 

add 

44/ xaddr 

>of  picK  i-f  1 /2  ) 

909994 

177749 
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FILTER  macro  vtAUG75  18-nov-77  02:55  paos  1-2 


29 

00154 

062767 

add 

44«'4addr 

>and  ptc2(  1-t-l  ,2  ) 

000004 

177734 

30 

00162 

000167 

177734 

jmp 

a1 

31 

1 + 

32 

1 ■pa'tial  on  in+mrior  poin'ts 

33 

t of  “tKo  pic4ur«  matrix  u«  i nq 

3x3 

34 

/ — 

35 

00166 

016702 

177722 

cl : mov 

xaddr/r2 

36 

00172 

016201 

mov 

-2<  r2  ),rl 

ipicK  l-l/j) 

177776 

37 

00176 

066201 

000002 

add 

2(  r2  ),rl 

; + 

picK  iM/j) 

38 

00202 

166702 

177700 

sub 

1 ncr/r2 

39 

00206 

OG1201 

add 

0r2/r 1 

1 + 

P i cl ( if  J— 1 > 

40 

00210 

066201 

177776 

add 

-2<  r2  ),rl 

j + 

picK  i-1/  j— 1 ) 

41 

00214 

066201 

000002 

add 

2( r2  ),rl 

j + 

PlcK  1 + 1 * J-1  ) 

42 

00220 

016702 

177670 

mov 

xaddr / r2 

43 

00224 

066702 

177656 

add 

1 ncr  ,r'Z 

44 

00230 

061201 

add 

0r2  > rl 

» + 

PicK  1 » J+1  ) 

45 

00232 

066201 

177776 

add 

-2<  r2  )/rl 

> + 

picK  i — If  J+1  ) 

46 

00236 

066201 

000002 

add 

2(  r2  ),rl 

; + 

picK  i + 1/  J+1  ) 

47 

00242 

006201 

asr 

r 1 

48 

00244 

006201 

asr 

r 1 

49 

00246 

006201 

asr 

r 1 

50 

00250 

010177 

177642 

mov 

r 1 / 0^4addr 

51 

00254 

062767 

000002 

177634 

add 

#2/>4addr 

52 

00262 

062767 

000002 

177624 

add 

42 , xaddr 

53 

00270 

005204 

1 nc 

r4 

54 

00272 

O0':1G7 

177634 

Jmp 

bl 

55 

000000' 

. «nd 

f i 1 -tar 

1 


The  execution  times  of  Instructions,  for  the  microcomputer  system 
built  using  the  AM  2900  series  bit  slice  family,  are  shown  below.  These 
instruction  times  assume  a data  and  program  memory  cycle  of  375 


nanoseconds . 


Microcomputer  Instruction  Set 


Qltttr*to-Mq1ittr  InstnictloiH 

AM 

Subtract 

NcMta 

Transfar  O.SS  vt 

Clear 

IncrMfit 

Orcraaent 

CaMlaaant 


I09lca1  Shift  I O.ftS  u< 
ArltMctIc  Shift  j 

Skip  on  Flaps  0.70  vt 

Input  \ 1.3  us 

Output  / 

No  Operation  O.SS  ui 

r»  Atfaranct  Instructions 


I 0.05  uS  * 0.1S  uS  I 


Mrltip)/  3.55  u5 

OlvIM  a.x  wt 

Ce«part  0.9S  us 

Unit  1.3  ws 

Nmory  Incratftt  \ 1.1  wt 
Nenpry  Otcraatnt  / 

AND  1 

00  } O.fS  wt 

Ciclutlvt  00  J 

Oraneh 

Uncpnditlenally  O.SS  us 

Iranch  to 

Subroutine  0.70  uS 


x5\  •' 


The  following  program,  written  In  HP  assembler  language  Is  for  smoothing 
an  M X N array.  The  HP  instructions  correspond  very  closely  to  the  micro- 
computer instruction  set.  Therefore,  a translation  was  made  between  the 
instruction  sets.  Two  assumptions  were  made.  The  first  assumption  is 
that  each  level  of  indirect  addressing  adds  .5  microseconds.  Also,  the 
HS  "ISZ"  Instruction  corresponds  to  the  combination  of  the  follovilng  three 
microcomputer  instructions:  "Increment,"  "Compare,"  and  "Skip." 

The  program  was  analyzed  and  the  follovfing  timing  equation  was  obtained. 

T^  ■ 30.2  mn  - 60. (m-n)  + 100  microseconds. 


THIS  IS  A PffOGAAN  (N  HP  ASSefCLSir  (.AMCUAGC  POff  SflDOTNINC  A 
PICtVAE  01GiTI^EO  IN  AN  NaN  nATAlX. 

IT  IS  ASSUrtD  T>4AT  Twe  EUrrtNTS  OP  TKS  INPUT  DICfTAL  PIC7UAC 
AAE  STORED  IN  A VECTOR  PRCfl  TQP-LEPT  TO  BOTTOM-PICHT, 

COLUI^  ev  COLUm  <E.6.  the  ELCrCNT  A<l.l)  OCCUPIES  TNC 
AD8CESS  UTHC  ElFI^NT  Afl.2>  THE  ADDRESS  Na>|  . TNE  CLCrCNT 
A<2«U  THE  ADDRECS  2 ETC.) 


NATf  PENO.P 

ENT  PEHO 

EXT  .ENTR 

NOP 

NOP 

NOP 


Mir 

JSB 

.ENTR 

MIS 

D£F 

C 

MIS 

A 

DBM 

LDS 

D 

MSI 

AD8 

KPl 

M22 

STB 

Dl 

DD23 

LDB 

El 

De24 

STB 

K 

BBSS 

LDB 

C 

asM 

A06 

NPI 

BB22 

STB 

Cl 

esas 

A 

M2S 

• 

HM 

LU 

ADB 

Nhi 

MSI 

A 

IT7F1 

IDA 

1N8 

I.X 

M34 

ADA 

l.l 

aass 

INB 

sess 

ADA 

l.I  - 

BBSF 

ADB 

m 

BB3B 

ADA 

t.i 

tS39 

ADB 

m 

W48 

ADA 

l.l 

aa4i 

ADB 

f« 

sa42 

MDA 

i.t 

BB43 

ADB 

m 

BB44 

.:da 

l.l 

M49 

ADB 

N 

DBAS 

BBAF 

A 

ADA 

t.l 

BB4B 

BB4B 

A 

AR8 

BB9B 

ARS 

■m 

ARS 

MU 

A 

sass 

SBS4 

A 

STA 

SI. I 

BBSS 

A 

BBSS 

BBSF 

A 

IS2 

K 

MM 

jrp 

LDI 

HS9 

LDB 

Kl 

aSM 

STS  K 

BBGt 

A 

aM2 

A 

SMS 

A 

asM 

A 

SMS 

A 

aaM 

A 

ssar 

LM 

Dl 

asM 

ADB 

Dl 

eM9 

STS 

Dl 

BB7B 

1 no 

C! 

•BFl 

ADB 

OJ 

Mra 

SIB 

Cl 

Mn 

1S2 

H 

far4 

LD2 

Sra 

jrr 

AENO. 

Sara 

A 

aarr 

A 

A 

BB>S 

A 

RAMA 

LSI 

IS2 

Dl 

BBBI 

LDB 

Cl 

asai 

INB 

aMi 

STS 

Cl 

BSB4 

A 

iw 

L92 

aaM 

A 

SMr 

NP| 

DCC 

fl 

aaM 

IV11 

DEC 

9 

SSM 

m 

DCC 

-1 

AAAA 

•1 

DEC 

J 

aMi 

ai 

DEC 

-B 

asM 

K 

KC 

B 

SMI 

M 

DCC 

-a 

asM 

M 

DEC 

tB 

saas 

m 

DEC 

-IB 

saw 

ai 

NOP 

saar 

Cl 

NOP 

saM 

END 

SSM 

ENDB 

LIIT  CMB 

INSTRUCTIONS  NEEDED  FOR  TRANSFERRING  PARAfClCRS 
FROn  FORTRAN  WIN  PROGRAM  TO  ASSErSLER 
SUBROUTINE  (TVJO  HATRICES  ARE  TRANSFERRED  » C AMD  t) 


INSTRUCTIONS  FOR  PREPARING  REGISTERS. 

THEY  ARE  OUT  OF  LOOPS. 

sore  instructiq.^s  here  and  along  the 

PROGRAfI  ARE  NEEDED  SECAUSS  SrCOTHIKC  IS  NOT 
applied  to  TNE  4aCN-l)  BORDER  ELEfCNTS 
OF  THE  INPUT  WTRIX. 

THE  ELETCNTS  to  UICN  THE  ALG0R1T>P1  IS 
applied  UILL  BE  CALLED  •EUTENTS  OF  INTEREST* 


GO  TO  THE  ADDRESS  OF  THE  ELETENT  PLACED 
AT  NORTN--EAST  OF  THE  CONSIDERED  ONE. 

STORE  ITS  VALUE. 

60  TO  THE  EAST  ELErCNT. 

ADD  THE  values  OF  THE  EPST  AND  NORTH-CAST  ELCrCNTS 


SCAN  IN  SUCCESSION  THE  RCnAINING  B-NEI  >«-BOURS  OF 
THE  CCNSIDEREL  ELErENT  AND  ADD  THEIR  VALUES 
TD6HCTHER. 


SHIFT  3 Tires  TOUARDS  RIGHT.  THE  SINARY  VALUE  OF 

THE  sun  JUST  obtained,  e.c. divide  by  i. 


RELABEL  THE  CONSIDERED  ELErCNT. 


INCRETCNT  K AND  IF  K«B  SKIP  NEXT  INSTRUCTION  AND 
RESET  K TO  K1  OTHERUISE  JUrP  TO  LOI. 

THESE  INSTRUCTION  ARE  RL3U1RED  TC  SKIP  THE 
2«(N-21  EterCNTS  BELONGING  TO  THE  FIRST  AND 
LAST  ROU  OF  THE  INPUT  MATRIX. 


IC*B  THE  *ELErCNTS  OF  INTEREST*  OF  ONE  COLUTH 
OF  THE  INPUT  fWTRIX  (I.  £.  THE  VECTOR'S  ELEMENTS 
FROn  THE  2.ND  TO  THE  (H-l).TH  CR  TK‘E  CHES  FROM  ?MC 
(2Na-2).TH  TO  THE  (3N-O.TH  ETC.)  HAVE  ALL  BEEN 
CONSIDERED.  TO  INITIATE  thF  crANMT*<r  jr  T.iZ  Jic^r 
CJLurr.  !H5  CURSEKT  ELEf'ENT  ADDRESS  HAS  TO  BE 
INCREie.NTCD  BY  3.  A TEST  IS  ALSO  PER'ORfeD  TO 
ASCERTAIN  LHETHER  SUCH  AN  INCREreNTATtCN  HAS  BEEN 
MADE  tH-2)  Tires  <IN  THIS  CASE  The  MATRIX  HAS  BEEN 
COMPLETELY  SCANNED)  CR  HOT.  ACCCRDING  TO  THE 
result  of  THE  TEST.  THE  PROGRAM  JUfPS  TO  THE 
SUBROUTINE  PEND  LNICH  RSTURhS  TO  ThC  FORTRAN 
r«m  PROGRAil  •PRINT  AND  £K>*.  OR  GOES  TO  LD2. 


GO  TO  the  HE>T  ELErCNT  OF  THE  ^CTOR  AND 
APPLY  TO  IT  THE  ROUTINE  LD2 


INSTRUCriONS  DEFINING  THE  QUANTITIES  USED  IN  THE 
PROGRAM.  THEY  ARE  OUT  OF  THE  LOOPS. 

THE  NUrCRtCAL  VALUES  GIVEN  CN  THE  LEFT  ARE 
DCDUCEDFROM  TNE  FOLLOUING  FCRrtJLAS.  FOR  N*ID. 

NPI-H4I  HMI-M-I  m»-l 

03-3  El— <N-2)  E«D 

H— Clt-2>  H-M 


BSl  a;,  wi  CCPV 
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